Cluster compression for compressing weights in neural networks

ABSTRACT

A method for instantiating a convolutional neural network on a computing system. The convolutional neural network includes a plurality of layers, and instantiating the convolutional neural network includes training the convolutional neural network using a first loss function until a first classification accuracy is reached, clustering a set of F×K kernels of the first layer into a set of C clusters, training the convolutional neural network using a second loss function until a second classification accuracy is reached, creating a dictionary which maps each of a number of centroids to a corresponding centroid identifier, quantizing and compressing F filters of the first layer, storing F quantized and compressed filters of the first layer in a memory of the computing system, storing F biases of the first layer in the memory, and classifying data received by the convolutional neural network.

RELATED APPLICATIONS

This application is a Continuation Application of U.S. application Ser.No. 16/273,592, filed on 12 Feb. 2019, which is a non-provisional patentapplication of and claims priority to U.S. Provisional Application No.62/642,578, filed 13 Mar. 2018 and U.S. Provisional Application No.62/672,845, filed 17 May 2018, all of which are incorporated byreference herein.

FIELD OF THE INVENTION

The present invention relates to a neural network, and more particularlyrelates to quantizing and compressing the parameters of a neural networkby using clustering techniques.

BACKGROUND

Today, neural networks (in particular convolutional neural networks) arewidely used for performing image recognition/classification (e.g.,recognizing an entire image to be of the class “dog”), objectrecognition/classification (e.g., recognizing that in an image there isa “dog” at a particular position with a particular size) and imagesegmentation (e.g., identifying certain pixels in an image that belongto an object in the class of “dog”). While having numerous applications(e.g., object identification for self-driving cars, facial recognitionfor social networks, etc.), neural networks require intensivecomputational processing and frequent memory accesses. Described hereinare techniques for reducing the quantity of data accessed from a memorywhen loading the parameters of a neural network into a compute engine.

SUMMARY OF THE INVENTION

In one embodiment, the present invention provides a method forinstantiating a convolutional neural network on a computing system. Theconvolutional neural network includes a plurality of layers, andinstantiating the convolutional neural network includes training theconvolutional neural network using a first loss function until a firstclassification accuracy is reached, clustering a set of F×K kernels ofthe first layer into a set of C clusters, training the convolutionalneural network using a second loss function until a secondclassification accuracy is reached, creating a dictionary which mapseach of a number of centroids to a corresponding centroid identifier,quantizing and compressing F filters of the first layer, storing Fquantized and compressed filters of the first layer in a memory of thecomputing system, storing F biases of the first layer in the memory, andclassifying data received by the convolutional neural network.

The first loss function calculates a classification error of theconvolutional neural network. Training the convolutional neural networkwith the first loss function includes optimizing, for a first one of thelayers, a first set of F filters and a first set of F biases so as tominimize the first loss function, wherein each of the F filters isformed from K kernels, wherein each of the kernels has N-dimensions, andwherein each of the biases is scalar.

Each of the clusters is characterized by a centroid, thus, the Cclusters are characterized by C centroids. Each of the centroids hasN-dimensions, and C is less than F×K.

The second loss function is a linear combination of the classificationerror and a clustering error. The clustering error aggregates adeviation between each of the kernels and one of the C centroids nearestto the respective kernel. Training the convolutional neural network withthe second loss function includes optimizing the F filters and the Fbiases so as to minimize the second loss function.

The quantizing and compressing of the F filters of the first layerincludes, for each of the F×K kernels, replacing the kernel with acentroid identifier that identifies a centroid that is closest to thekernel. The F quantized and compressed filters include F×K centroididentifiers.

The classification includes retrieving the F quantized and compressedfilters of the first layer from memory; decompressing, using thedictionary, the F quantized and compressed filters of the first layerinto F quantized filters by mapping the F×K centroid identifiers intoF×K corresponding quantized kernels, the F×K corresponding quantizedkernels forming the F quantized filters; retrieving the F biases of thefirst layer from memory; and for the first layer, computing aconvolution of the received data or data output from a layer previous tothe first layer with the F quantized filters and the F biases.

In some embodiments, the present method also includes, during theinstantiation of the convolutional neural network, further storing thedictionary in the memory; and during the classification of the receiveddata, further retrieving the dictionary from the memory.

In various embodiments, C may be equal to 2^(n), where n is a naturalnumber, the centroid identifier may be expressed using n-bits, the F×Kkernels may be clustered into the C clusters using the k-meansalgorithm, and/or each of the F×K kernels may be represented by a√{square root over (N)}×√{square root over (N)} matrix of parameters.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 depicts a diagram providing an overview of model training andmodel application in a neural network.

FIG. 2 depicts a diagram of the input, model parameters and output of aconvolution operation, the model parameters including a single2-dimensional filter.

FIG. 3 depicts a diagram that explains the computation of a convolutionoperation using a 2-dimensional filter.

FIG. 4 depicts a diagram of the input, model parameters and output of aconvolution operation, the model parameters including a plurality of2-dimensional filters.

FIG. 5 depicts a diagram of the input, model parameters and output of aconvolution operation, the model parameters including a single3-dimensional filter.

FIG. 6 depicts a diagram that explains the computation of a convolutionoperation using a 3-dimensional filter.

FIG. 7 depicts a diagram of the input, model parameters and output of aconvolution operation, the model parameters including a plurality of3-dimensional filters.

FIG. 8 depicts a computing system within which a neural network isinstantiated.

FIG. 9 depicts a convolutional neural network including a plurality oflayers, in accordance with one embodiment of the invention.

FIG. 10 depicts a diagram in which F filters are constructed using F×Kkernels, and a visualization of the F×K kernels as points in anN-dimensional space, in accordance with one embodiment of the invention.

FIG. 11 depicts a diagram in which the F×K kernels are clustered into Cclusters, and each of the C clusters is characterized by a centroid, inaccordance with one embodiment of the invention.

FIG. 12 depicts a diagram to illustrate a joint optimization processperformed on the F×K kernels, in accordance with one embodiment of theinvention.

FIG. 13A depicts a diagram to illustrate the quantization andcompression of the F filters during the training of the convolutionalneural network, in accordance with one embodiment of the invention.

FIG. 13B depicts a diagram to illustrate the quantization andcompression of the F filters during the training of the convolutionalneural network, in accordance with one embodiment of the invention.

FIG. 14 depicts a dictionary used during the compression andde-compression of the F filters, in accordance with one embodiment ofthe invention.

FIG. 15 depicts a diagram to illustrate the de-compression of the Ffilters during the application of the convolutional neural network, inaccordance with one embodiment of the invention.

FIG. 16 depicts a flowchart with an overview of the model training andmodel application of a convolutional neural network, in accordance withone embodiment of the invention.

FIG. 17 depicts a flowchart of a process performed during the trainingof a convolutional neural network, in accordance with one embodiment ofthe invention.

FIG. 18 depicts a flowchart of a process performed during theapplication of a convolutional neural network, in accordance with oneembodiment of the invention.

FIG. 19 depicts components of a computer system in which computerreadable instructions instantiating the methods of the present inventionmay be stored and executed.

DETAILED DESCRIPTION OF THE INVENTION

In the following detailed description of the preferred embodiments,reference is made to the accompanying drawings that form a part hereof,and in which are shown by way of illustration specific embodiments inwhich the invention may be practiced. It is understood that otherembodiments may be utilized and structural changes may be made withoutdeparting from the scope of the present invention. Descriptionassociated with any one of the figures may be applied to a differentfigure containing like or similar components/steps.

Multi-layer machine learning models (e.g., neural networks), especiallythose used for computer vision, have tens or hundreds of millions ofparameters. The size of such a model can easily be as big as tens orhundreds of megabytes. There are multiple possible reasons why it wouldbe advantageous to reduce the size of such a model. One of the possiblereasons is to minimize the model size so that the parameters can bestored using smaller memories, such as a static random access memory(SRAM) on the same chip as the convolver/convolution engine, as opposedto a memory external from the convolver/convolution engine (e.g.,dynamic random access memory (DRAM) or Flash memory). More importantly,reducing the amount of memory accesses when loading all weights of aparticular layer into a computation engine of a neural network iscrucial for low-power applications.

FIG. 1 depicts a diagram providing an overview of the training phase andthe inference phase in a neural network. In the training phase, pairs ofinput and known (or desired) output may be provided to train modelparameters (also called “weights”) of classification model 104. Forconciseness, only one input and output pair (102, 106) is depicted inFIG. 1, but in practice many known input and output pairs will be usedto train classification model 104. In the example of FIG. 1, input 102is a matrix of numbers (which may represent the pixels of an image) andknown output 106 is a vector of classification probabilities (e.g., theprobability that the input image is a cat is 1, the probability that theinput image is a dog is 0, and the probability that the input image is ahuman is 0). In one possible training process, the classificationprobabilities may be provided by a human (e.g., a human can recognizethat the input image depicts a cat and assign the classificationprobabilities accordingly). At the conclusion of the model trainingprocess, the model parameters will have been estimated (e.g., W1=1.2,W2=3.8, W3=2.7). Sometimes, there may be intuitive ways to interpret themodel parameters, but many times no intuition may be associatedtherewith, and the model parameters may simply be the parameters thatminimize the error between the model's classification of a given set ofinput with the known classification (while at the same time avoiding“model overfitting”).

In the inference (or prediction or feed-forward) phase, model 104 withtrained parameters (i.e., parameters trained during the training phase)is used to classify a set of input. In the instant application, thetrained classification model 104 provides the classification output 110of a vector of probabilities (e.g., the probability that the input imageis a cat is 0.3, the probability that the input image is a dog is 0.6,and the probability that the input image is a human is 0.1) in responseto input 108.

One embodiment of classification model 104 is a convolutional neuralnetwork, which will be described in detail in FIG. 9. A basic buildingblock of a convolutional neural network is a convolution operation,which is described in FIGS. 2-7. As further described below, aconvolution operation may refer to a 2-dimensional convolution operationwith 2-dimensional input and a 2 dimensional filter, a 3-dimensionalconvolution operation with 3-dimensional input and a 3 dimensionalfilter, etc. Other building blocks of a convolutional neural network mayinclude a long short-term memory (LSTM) unit (not described herein forconciseness).

FIG. 2 depicts a diagram of the input, model parameters and output of aconvolution operation, the model parameters including a single2-dimensional filter. In the example of FIG. 2, the input includes a2-dimensional matrix of numerical values (each of the numerical valuesabstractly represented by “•”). The matrix in the example of FIG. 2 is a4×4 matrix, but other input could have different dimension (e.g., couldbe a 100×100 square matrix, a 20×70 rectangular matrix, etc.). Laterpresented examples will illustrate that the input may even be a3-dimensional object. In fact, the input may be an object of any numberof dimensions. The input may represent pixel values of an image or mayrepresent the output of a previous convolution operation.

The model parameters may include a filter and a bias. In the example ofFIG. 2, the filter is a 2×2 matrix of values and the bias is a scalarvalue. Later presented examples will illustrate that the dimensions ofthe filter may vary (e.g., could be a 3×3 square matrix) and further thenumber of dimensions may be greater than 2 (e.g., could be a 3×3×256object). Typically, there is one bias associated with each filter. Theexample in FIG. 2 includes one filter, so there is one correspondingbias. However, in certain embodiments, if there were 5 filters, therewould be 5 associated biases.

The convolution operator 208 (abbreviated “cony”) receives input 202 andthe model parameters 204, 206 and generates output 210 called anactivation map or a feature map. Each value of the activation map isgenerated as a dot product of input 202 and filter 204 (at a certainspatial location relative to input 202) summed with the bias 206. Thecomputations to arrive at activation map 210 are described in moredetail below in FIG. 3.

The first row of FIG. 3 describes the computation of the element atposition (1, 1) of activation map 210. As shown in the first row, filter204 is “aligned” with the elements at positions (1, 1), (1, 2), (2, 1)and (2, 2) of input 202. A dot product is computed between filter 204and these four values of input 202. The dot product is then summed withbias b to arrive at the element at position (1, 1) of activation map210.

The second row of FIG. 3 describes the computation of the element atposition (1, 2) of activation map 210. As shown in the second row,filter 204 is “aligned” with the elements at positions (1, 2), (1, 3),(2, 2) and (2, 3) of input 202. A dot product is computed between filter204 and these four values of input 202. The dot product is then summedwith bias b to arrive at the element at position (1, 2) of activationmap 210.

The third row of FIG. 3 describes the computation of the element atposition (1, 3) of activation map 210. As shown in the third row, filter204 is “aligned” with the elements at positions (1, 3), (1, 4), (2, 3)and (2, 4) of input 202. A dot product is computed between filter 204and these four values of input 202. The dot product is then summed withbias b to arrive at the element at position (1, 3) of activation map210.

The fourth row of FIG. 3 describes the computation of the element atposition (3, 3) of activation map 210. As shown in the fourth row,filter 204 is “aligned” with the elements at positions (3, 3), (3, 4),(4, 3) and (4, 4) of input 202. A dot product is computed between filter204 and these four values of input 202. The dot product is then summedwith bias b to arrive at the element at position (3, 3) of activationmap 210. In general, the convolution operation comprises a plurality ofshift (or align), dot product and bias (or sum) steps. In the presentexample, the filter was shifted by 1 spatial position between dotproduct computations (called the step size or stride), but other stepsizes of 2, 3, etc. are possible. Also, while it is possible to add zeropadding around input 202, zero padding was not described in FIG. 3 forconciseness and simplicity.

FIG. 4 is similar to FIG. 2, except that there are F filters 404, Fbiases 406 and F activation maps 410 instead of a single filter 204, asingle bias 206 and a single activation map 210. The relation betweenthe F filters 404, F biases 406 and F activation maps 410 is as follows.Filter f₁, bias b₁ and input 402 are used to compute activation map y₁(in very much the same way that filter 204, bias 206 and input 202 wereused to compute activation map 210 in FIG. 2); filter f₂, bias b₂ andinput 402 are used to compute activation map y₂; and so on.

FIG. 5 is similar to FIG. 2, except that instead of a 2-dimensionalinput 202 and a 2-dimensional filter 204, there is a 3-dimensional input502 and a 3-dimensional filter 504. The computations to arrive atactivation map 510 are described in more detail below in FIG. 6. Whileinput 502 and filter 504 are 3-dimensional, activation map 510 is still2-dimensional, as will become clear in the associated description ofFIG. 6.

The first row of FIG. 6 describes the computation of the element atposition (1, 1) of activation map 510. As shown in the first row, filter504 is “aligned” with the elements at positions (1, 1, z), (1, 2, z),(2, 1, z) and (2, 2, z) of input 502, where z ∈ {1, 2, 3, 4}. A dotproduct is computed between filter 504 and these sixteen values of input502. The dot product is then summed with bias b to arrive at the elementat position (1, 1) of activation map 510.

The second row of FIG. 6 describes the computation of the element atposition (1, 2) of activation map 510. As shown in the second row,filter 504 is “aligned” with the elements at positions (1, 2, z), (1, 3,z), (2, 2, z) and (2, 3, z) of input 502, where z ∈ {1, 2, 3, 4}. A dotproduct is computed between filter 504 and these sixteen values of input502. The dot product is then summed with bias b to arrive at the elementat position (1, 2) of activation map 510.

The third row of FIG. 6 describes the computation of the element atposition (1, 3) of activation map 510. As shown in the third row, filter504 is “aligned” with the elements at positions (1, 3, z), (1, 4, z),(2, 3, z) and (2, 4, z) of input 502, where z ε {1, 2, 3, 4}. A dotproduct is computed between filter 504 and these sixteen values of input502. The dot product is then summed with bias b to arrive at the elementat position (1, 3) of activation map 510.

The fourth row of FIG. 6 describes the computation of the element atposition (3, 3) of activation map 510. As shown in the fourth row,filter 504 is “aligned” with the elements at positions (3, 3, z), (3, 4,z), (4, 3, z) and (4, 4, z) of input 502, where z E {1, 2, 3, 4}. A dotproduct is computed between filter 504 and these sixteen values of input502. The dot product is then summed with bias b to arrive at the elementat position (3, 3) of activation map 510. If not already apparent, thedepth of filter 504 (or the magnitude of the “z” dimension of filter504) must match the depth of input 502 (or the magnitude of the “z”dimension of input 502). The depth of input 502 is sometimes called thenumber of channels of input 502. In the present example of FIG. 5, thedepth of input 502 and filter 504 were both four.

FIG. 7 is similar to FIG. 5, except that there are F 3-dimensionalfilters 704, F biases 706 and F activation maps 710 (F>1), instead of asingle 3-dimensional filter 504, a single bias 506 and a singleactivation map 510. The relation between the F 3-dimensional filters704, F biases 706 and F activation maps 710 is as follows. Filter f₁,bias b₁ and input 702 are used to compute activation map y₁ (in verymuch the same way that filter 504, bias 506 and input 502 were used tocompute activation map 510 in FIG. 5); filter f₂, bias b₂ and input 702are used to compute activation map y₂; and so on.

FIG. 8 depicts computing system 800 within which a neural network may beinstantiated. Processor 806 may retrieve input data (similar to input202 in FIG. 2, input 402 in FIG. 2, etc.) from memory 802. Examples ofprocessor 806 include a general processor, a central processing unit(CPU), a field-programmable gate array (FPGA), an application-specificintegrated circuit (ASIC), a microcontroller, etc. Examples of memory802 include SRAM, DRAM, Flash memory, electrically erasable programmableread-only memory (e.g., EEPROM), etc. Processor 806 may retrieve modelparameters (similar to filter 204 and bias 206 in FIG. 2, filter 404 andbias 406 in FIG. 4, etc.) from memory 804. Based on the retrieved inputand model parameters, processor 806 may perform convolutioncomputations, rectified linear unit (reLU) computations, max poolingcomputations, and other computations in order to generate aclassification 808 of the input from memory 802.

FIG. 9 depicts convolutional neural network 904 which includes aplurality of layers (or stages), in accordance with one embodiment ofthe invention. The input with dimension 512×512×3 may conceptuallycorrespond to input 102 (or 108) depicted in FIG. 1. In the example ofFIG. 9, the input may be a portion of a color image consisting of 512 by512 pixels. The input depth of 3 (i.e., z-dimension of input) mayrepresent the red, green, and blue components of the color image.

The output with dimensions 8×8×200 may conceptually correspond to output106 (or 110) depicted in FIG. 1. In the example of FIG. 9, the first 8×8matrix (i.e., (x, y, 1) for x ∈ {1, . . . , 8} and y ∈ {1, . . . , 8})in the output may classify objects in the input. Each classificationvalue in the first 8×8 matrix may indicate a certain object (e.g., 1 forcat, 2 for dog, 3 for human). The position of the classification valuein the first 8×8 matrix may provide spatial information of theclassified object in the input image (e.g., value of 1 in element (1, 1)of the first 8×8 matrix may indicate a cat being located in the top leftcorner of the input image). The depth (i.e., z) dimension of the outputmay specify attributes associated with the classified object. Theattributes may include whether the object is oriented in a verticalmanner (e.g., standing position), whether the object is oriented in ahorizontal manner (e.g., lying down position), etc.

Convolutional neural network 904 in FIG. 9 may be an example ofclassification model 104 in FIG. 1. Convolutional neural network 904 maybe constructed using a plurality of layers (e.g., convolution layer,reLU layer, max pool layer, LSTM layer, etc.). Note that the particularstructure of a given convolutional neural network may vary in terms ofthe numbers and types of layers, the dimensions of the input and output,the dimensions of the filters, and the like. In the example of FIG. 9,only convolution layers are depicted for simplicity. In layer 1, theinput with a dimension of 512×512×3 is convolved with 64 filters and 64biases to generate an activation map with a dimension of 256×256×64. Inlayer 2, the input with a dimension of 256×256×64 is convolved with 128filters and 128 biases to generate an activation map with a dimension of128×128×128. In layer h, the input with a dimension of 16×16×256 isconvolved with 256 filters and 256 biases to generate an activation mapwith a dimension of 16×16×256. Other layers of convolutional neuralnetwork 904 have been omitted for conciseness.

Similar to the example of FIG. 1, the parameters of convolutional neuralnetwork 904 are trained during a training phase. The parameters ofconvolutional neural network 904 include the parameters that are part ofthe various filters and biases. Once the parameters have been trained,convolutional neural network 904 may be applied to classify theinformational content within an input image. It will be more apparentfollowing the quantization and data compression techniques described inthe next several figures, but it is noted for clarity that the structureof convolutional neural network 904 depicted in FIG. 9 more specificallycorresponds to the convolutional neural network 904 during the trainingphase. During the inference phase, codewords are retrieved from memory804, which are decompressed into quantized versions of the 256 filtersof layer h, and the 256 quantized filters (along with 256 correspondingbiases) are used to perform the convolution operation for layer h.

In battery-powered devices, power is often limited. Power is limited notonly due to the finite reservoir of energy stored in the battery, butalso because high power use can lead to excessive heating of the devicewhich is not desirable. In particular, power is consumed each timeparameters are retrieved from memory, such as memory 804. One object ofthe present invention is to improve the functioning of the computingsystem 800 by reducing its power consumption. The power consumption maybe reduced by minimizing the amount of data that is read from (or storedto) memory, such as memory 804. One way to minimize the amount of datathat is stored to and/or read from memory 804 is to store and/orretrieve compressed and quantized version of the filters. For clarity inexplanation, the following discussion will focus on quantizing andcompressing the 256 filters of layer h, but it is understood that thesame techniques may be applied to quantize and compress filters of adifferent one of the layers.

While minimizing the power usage could be one benefit and/or applicationof the invention, it is understood that there may be other applicationssuch as reducing the amount of data that is stored to and retrieved frommemory 804 (which has applicability even in a system that is notpower-limited), which in turns allows for a reduced size of memory 804.As briefly explained above, the reduced size of memory 804 may allow thememory to be located on the same chip as the convolver/convolutionengine (e.g., same chip as processor 806), allowing for a quickerstorage and retrieval of the model parameters.

As an overview of the algorithm to compress and quantize the 256 filtersof layer h, the 256 filters are first trained via a backpropagationalgorithm (or another optimization procedure such as a particle swarmoptimization, a genetic algorithm, etc.). For completeness, it is notedthat the backpropagation algorithm is applied to optimize the parametersof the 256 filters of layer h and its associated 256 biases (as well asthe parameters of filters and biases from other layers) with theobjective of minimizing the classification error (i.e., minimizing thelikelihood that the predicted classification does not match the correctclassification provided in the training data). Stated differently, theloss function of the conventional backpropagation algorithm may includethe classification error, and the backpropagation algorithm may minimizethis loss function. The backpropagation algorithm is well known in thefield and will not be described herein for conciseness. After the 256filters are trained until a first degree of classification accuracy isreached, the 256 filters are quantized. After the 256 filters arequantized, the 256 quantized filters are compressed. A compressedversion of the 256 quantized filters is stored in memory 804. During themodel application phase, the 256 compressed and quantized filters areretrieved from memory 804. The 256 compressed and quantized filters aredecompressed into 256 quantized filters. The convolution operation fromlayer h is then performed using the 256 quantized filters (and theassociated 256 biases). The storing and retrieval of the 256 biasesto/from memory 804 may proceed in a conventional fashion so thesedetails will not be described for the sake of conciseness.

The quantization, compression and decompression approach is nowdescribed in detail with respect to FIGS. 10-15. In FIG. 10, it isassumed that the 256 (or F) filters depicted are the filters which havealready been trained by a backpropagation algorithm (or otheroptimization algorithm) to minimize the classification error. Each ofthe F filters is then decomposed into K kernels. In the example of FIGS.9 and 10, each filter is a 3-dimensional object with dimensions 3×3×256(or 3×3×K). FIG. 10 illustrates how each of the filters can bedecomposed into 256 (or K) 3×3 kernels. For example, filter f₁ isdecomposed into K kernels, k_(1,j) for j=1 . . . K; filter f₂ isdecomposed into K kernels, k_(2,j) for j=1 . . . K; and so on. If onewere to visualize a filter as a stick of butter, the K kernels (at leastfor this particular decomposition) may be visualized as thin slices ofbutter.

It is noted that the decomposition of kernels illustrated in FIG. 10 isonly an example, and a filter can be decomposed into kernels in adifferent fashion. As another example (not depicted), each filter may beconstructed from nine kernels, in which kernel 1 is constructed as thevector with elements at the positions (1, 1, z) of the respectivefilter, where z ∈ {1, . . . , 256}; kernel 2 is constructed as thevector with elements at the positions (1, 2, z) of the respectivefilter, where z ∈ {1, . . . , 256}; . . . and kernel 9 is constructed asthe vector with elements at the positions (3, 3, z) of the respectivefilter, where z ∈ {1, . . . , 256}. If one were to visualize the filteras a stick of butter, the nine kernels (at least for this particulardecomposition) may be visualized as nine long skinny rods of butter.

Returning to the depicted decomposition in which each of the F filtersis decomposed into K 3×3 kernels, each of the F×K kernels may bevisualized as a point in a 9-dimensional space. The reason for a9-dimensional space is that a 3×3 kernel has 9 parameters (orcomponents). Each of the axes of the 9-dimensional space represents acomponent of the kernel. If 4×4 kernels were used, the kernels could bevisualized as a point in a 16-dimensional space.

In FIG. 11, the F×K kernels (each kernel being a 3×3 matrix) areclustered into C clusters using the k-means algorithm or otherclustering algorithm. For clarity, it is noted that no associationbetween the k in “k”-means and the variable k to denote a kernel isintended. The k-means algorithm is well known in the art and will not bedescribed herein for conciseness. For simplicity in explanation, assumethat C=8 (i.e., there are 8 clusters). For ease of depiction, only 3 ofthe 8 clusters have been depicted. The 8 clusters may be defined byVoronoi regions (i.e., partitions of the 9-dimensional space each ofwhich contains a respective cluster of the kernels). Voronoi region 1102corresponds to cluster 0; Voronoi region 1104 corresponds to cluster 1;Voronoi region 1106 corresponds to cluster 2; and so on for theremaining Voronoi regions which are not depicted. In one embodiment, Cis chosen to be a power of 2 (i.e., C=2^(n), where n is a naturalnumber) for reasons that will become clearer later on (i.e., allows forefficient use of bits to store centroid identifiers).

A centroid is associated with each of the clusters, and may be computedas the center of mass of the kernels within the cluster. If the centroidof a cluster were represented as a vector of numbers, the firstcomponent of the centroid would be computed as the average of the firstcomponents of all the kernels within the cluster, the second componentof the centroid would be computed as the average of the secondcomponents of all the kernels within the cluster, etc. Three centroidshave been drawn in FIG. 11 as black dots. Centroid c₀ corresponds tocluster 0; centroid c₁ corresponds to cluster 1; centroid c₂ correspondsto cluster 2; and so on for the remaining centroids which are notdepicted.

One way to quantize the F×K kernels would be to, at this point in theapproach, set each of the kernels equal to its corresponding centroid(i.e., the centroid of the Voronoi region within which the kernel islocated). While such type of quantization is possible, the proposedapproach performs a further optimization step (described in FIG. 12)before such quantization step is performed.

FIG. 12 describes the intuition behind an optimization process in whichthe classification error and the deviation between each of the kernelsand its closest centroid (called the “clustering error” for brevity) arejointly minimized. The loss function may be formulated as a linearcombination of the classification error and the clustering error (oranother mathematical function which combines the classification errorand the clustering error). The backpropagation algorithm (or otheroptimization algorithm) may be applied to minimize the instant lossfunction. For clarity, it is noted that the backpropagation algorithm isapplied to minimize a first loss function, thereby arriving at thefilters (and corresponding kernels) depicted in FIG. 11; centroids arecomputed using the k-means algorithm (as depicted in FIG. 11); and asubsequent backpropagation algorithm is applied to minimize a secondloss function (described in association with FIG. 12). Therefore, theremay be two stages of backpropagation—one prior to FIG. 11 and one duringFIG. 12.

FIG. 12 describes how, during the iterations of the secondbackpropagation stage, one may visualize the points (representingkernels) as moving in the 9-dimensional space. If the loss function onlyaccounts for the clustering error, each point would migrate towards itsclosest centroid. However, since the loss function contains theclassification error in addition to the clustering error, some pointsmay actually move toward a centroid other than the closest centroid. Forsimplicity of depiction and explanation, only two of the F×K kernels aredepicted in FIG. 12. Kernel k_(1,K) is depicted as moving towardcentroid c₁ (its closest centroid), while kernel k_(F,K) is depicted asmoving toward centroid c₀, even though its closest centroid, at leastinitially, is centroid c₂. The intuition is that the classificationerror would be greater if kernel k_(F,K) were set equal to centroid c₂than if kernel k_(F,K) were set equal to centroid c₀, so this is whyk_(F,K) moves towards c₀. On the other hand, to minimize the clusteringerror, it does not matter which centroid a kernel migrates towards aslong as at the conclusion of the optimization process, the kernel islocated close to a centroid. Stated differently, the clustering errorcomponent in the loss function is responsible for steering kernelstowards one of the centroids and the classification error component canbe thought of as influencing which centroid a kernel migrates towards.

The second backpropagation stage is performed until a secondclassification accuracy is reached. The second classification accuracyassociated with the second stage of backpropagation is not necessarilyequal to the first classification accuracy. Subsequently, the F×Kkernels are quantized by setting each kernel equal to its closestcentroid (where “closest” may be used in terms of Euclidian distance).In the example of FIG. 12, kernel k_(1,K) is set equal to centroid c₁,and kernel k_(F,K) is set equal to centroid c₀.

FIG. 13A summarizes the above-described steps of training and quantizingthe F filters of layer h, in addition to describing an additionalcompression step. At the start of the model training process, theconvolutional neural network is trained using a first loss functionuntil a first classification accuracy is reached. The first row offilters represents the F filters of layer h at the conclusion of thebackpropagation algorithm. Then, the convolutional neural network istrained using a second loss function until a second classificationaccuracy is reached. This second step of training is noted as “jointoptimization” in FIG. 13A, referring to the second loss function formedas a linear combination of the classification error and the clusteringerror. The second row of filters represents the F filters of layer h atthe conclusion of the joint optimization process. The F filters of layerh are then quantized. The number of parameters of each of the kernels ofthe filters remains unchanged during the quantization process, as shownin the unchanged geometric depiction of a kernel at the conclusion ofthe quantization process. However, during the quantization process, eachkernel is constrained to be one of the C centroids, which is why each ofthe kernels has been labeled with a centroid. It is understood that theparticular labeling of centroids (k_(1,1)=c₃, k_(1,2)=c₅, . . . ) isprovided as an example only. The F quantized filters may then becompressed by substituting each kernel with an identifier of thecentroid that the kernel equals. For example, c₃ is substituted with 3,c₅ is substituted with 5, and so on. A dictionary may be relied upon toperform this substitution. FIG. 14 depicts an example of a dictionarythat may be used to map centroid identifiers (also called codewords) tocentroids. Finally, the centroid identifiers (or codewords) may bestored in memory 804. While the quantization and compression of kernelswas described as two sequential steps in FIG. 13A, it is possible forthese two steps to be performed in a single step (i.e., replace eachkernel at the conclusion of the joint optimization process with acentroid identifier which identifies the centroid that is closest to therespective kernel), as depicted in FIG. 13B.

At this point, a numerical comparison is presented to demonstrate thereduction in bits that are needed to store the F compressed andquantized filters, as compared to the F “original” filters (top row ofFIG. 13A or 13B), as would be the case in a conventional approach.Assume that there are 256 filters (i.e., F=256), and each of the filtershas a dimension of 3×3×256. Further assume that each parameter of afilter is stored using 8 bits. Therefore, in a conventional approach,storing the 256 filters from layer h would require storing 3×3×256×256×8bits=4.7 million bits.

In the approach outlined in FIG. 13A (or FIG. 13B), assume that eachfilter is decomposed into 256 kernels (i.e., K=256). Further assume thatthe kernels are quantized into 8 clusters. As a result of the number ofclusters being 8, each of the codewords can be stored using 3 bits(i.e., =log₂(8)). Therefore, in the approach of FIG. 13A (or FIG. 13B),storing the 256 filters from layer h would require storing F×K×3bits=256×256×3=196,608 bits. For a complete comparison, it is noted thatthe approach of FIG. 13A (or FIG. 13B) also requires storing adictionary, which essentially requires the storing of 8 centroids or8×3×3×8 bits (i.e., storing the right column of FIG. 14)=576 bits.Therefore, the total number of bits stored for the 256 filters of layerh using quantization and compression is 196,608 bits+576 bits=197,184bits. Therefore, in this particular example for the 256 filters of layerh, there is approximately a reduction of 96% in the number of bits thatare stored in the approach of FIG. 13A (or FIG. 13B), which improves thefunctioning of the computing system 800, as compared to a conventionalapproach.

FIG. 15 describes a decompression operation performed during the modelapplication phase. For layer h, the previously described F×K codewordsare retrieved from memory 804, and decompressed using the exampledictionary in FIG. 14. The decompression operation involves substitutingcentroids in place of centroid identifies (or codewords). Thedecompression step produces F quantized filters, which are then used(along with F biases, not depicted) to perform the above-describedconvolution operation. It is understood that the same efficienciesdescribed above for the storing of quantized and compressed data inmemory 804 also apply to the reading of the quantized and compressedfilter data from memory 804.

FIG. 16 depicts flowchart 1600 with an overview of the model trainingand model application of a convolutional neural network, in accordancewith one embodiment of the invention. At step 1602, a convolutionalneural network is instantiated on computing system 800. Theconvolutional neural network may include a plurality of layers,including, in certain embodiments, one or more of a convolution layer, areLU layer and a max pooling layer. In step 1604, the convolutionalneural network may be used to classify data received by theconvolutional neural network.

FIG. 17 depicts flowchart 1700 of a process performed during thetraining of a convolutional neural network, in accordance with oneembodiment of the invention. At step 1702, processor 806 may train theconvolutional neural network using a first loss function until a firstclassification accuracy is reached. A first one of the layers of theconvolutional neural network, lh, h ∈ {1, . . . , L} may include Ffilters (each filter formed by K kernels) and F biases. For brevity, “afirst one of the layers” will be referred to as a “first layer”, but itis understood that the “first layer” is meant to refer to one of thelayers, not necessary the very first layer. At step 1704, processor 806may cluster the F×K kernels from the first layer, lh, into C clusters,each of the clusters characterized by a centroid. As explained above, acentroid of a cluster may be calculated as the center of mass of thekernels within the cluster. At step 1706, processor 806 may train theconvolutional neural network using a second loss function until a secondclassification accuracy is reached. At step 1708, processor 806 maycreate a dictionary which maps each of the centroids to a correspondingcentroid identifier. An example of such a dictionary is depicted in FIG.14. At step 1710, processor 806 may quantize and compress the Fquantized filters of the first layer, lh, by, for each of the F×Kkernels, replacing the kernel with a centroid identifier that identifiesa centroid that is closest to the kernel. At step 1712, processor 806may store the F quantized and compressed filters of the first layer, lh,in memory 804. The F quantized and compressed filters may comprise F×Kcentroid identifiers (or F×K codewords). At step 1714, processor 806 maystore the F biases of the first layer, lh, in memory 804.

FIG. 18 depicts flowchart 1800 of a process performed during theapplication of a convolutional neural network, in accordance with oneembodiment of the invention. At step 1802, processor 806 may retrievethe F quantized and compressed filters of the first layer, lh, frommemory 804. At step 1804, processor 806 may decompress, using thedictionary, the F quantized and compressed filters of the first layer,lh, into F quantized filters. At step 1806, processor 806 may retrievethe F biases of the first layer, lh, from memory 804. At step 1808,processor 806 may, for the first layer, lh, compute a convolution of thereceived data or data output from a previous layer with the F quantizedfilters and the F biases. If the first one of the layers, lh, is forh=1, the convolution may be computed with the data from memory 802(e.g., 512×512×3 input depicted in FIG. 9). Otherwise, the convolutionmay be computed with the data from a previous layer. For example, forlayer 2 in FIG. 9, the convolution would be computed with the 256×256×64activation map generated from layer 1, the 128 filters and the 128biases.

It is noted that while the compression and quantization of filters hasbeen described primarily in association with the convolution operation,it is understood that such algorithms may have similar applicability forother operations in a neural network involving the retrieval of filters(e.g., the LSTM operation). More generally, such algorithms may beapplied to recurrent networks (e.g., normal LSTM or recurrent neuralnetworks), which also utilize filters (e.g., mostly 1-dimensional andover time, and not over “space” (e.g., an image)). Such algorithms mayadditionally be applied to 1-dimensional convolutional and also3-dimensional convolutional networks, in which a 3-dimensionalconvolution layer could be used to process image-time data where thethird dimension would be the time dimension (in other words, the datawould be video data). Even more generally, such algorithms may beapplied to groups of parameters, as is present in a normalnon-convolutional LSTM cell, in which every cell is described by a groupof parameters. The grouping of parameters in a machine learning (ML)model could be designed in many possible ways.

FIG. 19 depicts components of a computer system in which computerreadable instructions instantiating the methods of the present inventionmay be stored and executed. As is apparent from the foregoingdiscussion, aspects of the present invention involve the use of variouscomputer systems and computer readable storage media havingcomputer-readable instructions stored thereon. FIG. 19 provides anexample of a system 1900 that may be representative of any of thecomputing systems (e.g., computing system 800) discussed herein.Examples of system 1900 may include a smartphone, a desktop, a laptop, amainframe computer, an embedded system, etc. Note, not all of thevarious computer systems have all of the features of system 1900. Forexample, certain ones of the computer systems discussed above may notinclude a display inasmuch as the display function may be provided by aclient computer communicatively coupled to the computer system or adisplay function may be unnecessary. Such details are not critical tothe present invention.

System 1900 includes a bus 1902 or other communication mechanism forcommunicating information, and a processor 1904 coupled with the bus1902 for processing information. Computer system 1900 also includes amain memory 1906, such as a random-access memory (RAM) or other dynamicstorage device, coupled to the bus 1902 for storing information andinstructions to be executed by processor 1904. Main memory 1906 also maybe used for storing temporary variables or other intermediateinformation during execution of instructions to be executed by processor1904. Computer system 1900 further includes a read only memory (ROM)1908 or other static storage device coupled to the bus 1902 for storingstatic information and instructions for the processor 1904. A storagedevice 1910, for example a hard disk, flash memory-based storage medium,or other storage medium from which processor 1904 can read, is providedand coupled to the bus 1902 for storing information and instructions(e.g., operating systems, applications programs and the like).

Computer system 1900 may be coupled via the bus 1902 to a display 1912,such as a flat panel display, for displaying information to a computeruser. An input device 1914, such as a keyboard including alphanumericand other keys, may be coupled to the bus 1902 for communicatinginformation and command selections to the processor 1904. Another typeof user input device is cursor control device 1916, such as a mouse, atrackpad, or similar input device for communicating directioninformation and command selections to processor 1904 and for controllingcursor movement on the display 1912. Other user interface devices, suchas microphones, speakers, etc. are not shown in detail but may beinvolved with the receipt of user input and/or presentation of output.

The processes referred to herein may be implemented by processor 1904executing appropriate sequences of computer-readable instructionscontained in main memory 1906. Such instructions may be read into mainmemory 1906 from another computer-readable medium, such as storagedevice 1910, and execution of the sequences of instructions contained inthe main memory 1906 causes the processor 1904 to perform the associatedactions. In alternative embodiments, hard-wired circuitry orfirmware-controlled processing units may be used in place of or incombination with processor 1904 and its associated computer softwareinstructions to implement the invention. The computer-readableinstructions may be rendered in any computer language.

In general, all of the above process descriptions are meant to encompassany series of logical steps performed in a sequence to accomplish agiven purpose, which is the hallmark of any computer-executableapplication. Unless specifically stated otherwise, it should beappreciated that throughout the description of the present invention,use of terms such as “processing”, “computing”, “calculating”,“determining”, “displaying”, “receiving”, “transmitting” or the like,refer to the action and processes of an appropriately programmedcomputer system, such as computer system 1900 or similar electroniccomputing device, that manipulates and transforms data represented asphysical (electronic) quantities within its registers and memories intoother data similarly represented as physical quantities within itsmemories or registers or other such information storage, transmission ordisplay devices.

Computer system 1900 also includes a communication interface 1918coupled to the bus 1902. Communication interface 1918 may provide atwo-way data communication channel with a computer network, whichprovides connectivity to and among the various computer systemsdiscussed above. For example, communication interface 1918 may be alocal area network (LAN) card to provide a data communication connectionto a compatible LAN, which itself is communicatively coupled to theInternet through one or more Internet service provider networks. Theprecise details of such communication paths are not critical to thepresent invention. What is important is that computer system 1900 cansend and receive messages and data through the communication interface1918 and in that way communicate with hosts accessible via the Internet.It is noted that the components of system 1900 may be located in asingle device or located in a plurality of physically and/orgeographically distributed devices.

Thus, cluster compression for compressing weights in neural networks hasbeen described. It is to be understood that the above-description isintended to be illustrative, and not restrictive. Many other embodimentswill be apparent to those of skill in the art upon reviewing the abovedescription. The scope of the invention should, therefore, be determinedwith reference to the appended claims, along with the full scope ofequivalents to which such claims are entitled.

What is claimed is:
 1. A method, comprising: instantiating a neuralnetwork on a computing system, the neural network including a pluralityof layers, wherein instantiating the neural network comprises: trainingthe neural network using a loss function until a classification accuracyis reached, wherein the loss function calculates a classification errorof the neural network, wherein training the neural network with the lossfunction comprises optimizing, for a first one of the layers, a set of Ffilters and a set of F biases so as to minimize the loss function,wherein each of the F filters is formed from K kernels, wherein K and Fare each greater than one, wherein each of the kernels comprises nineparameters and wherein each of the biases is scalar; clustering the setof F×K kernels of the first layer into a set of C clusters, wherein eachof the clusters is characterized by a centroid, thereby the C clustersbeing characterized by C centroids, wherein each of the centroidscomprises nine parameters, and wherein C is less than F×K; creating adictionary which maps each of the centroids to a corresponding scalarcentroid identifier; quantizing and compressing the F filters of thefirst layer by, for each of the F×K kernels, replacing the nineparameters of the kernel with one of the scalar centroid identifiersfrom the dictionary; storing the F quantized and compressed filters ofthe first layer in a memory of the computing system, the F quantized andcompressed filters comprising F×K scalar centroid identifiers; andstoring the F biases of the first layer in the memory; and classifyingdata received by the neural network, wherein the classificationcomprises: retrieving the F quantized and compressed filters of thefirst layer from the memory, the F quantized and compressed filterscomprising the F×K scalar centroid identifiers; decompressing, using thedictionary, the F quantized and compressed filters of the first layerinto F quantized filters by mapping the F×K scalar centroid identifiersinto F×K corresponding quantized kernels, the F×K correspondingquantized kernels each comprising nine parameters and forming the Fquantized filters; retrieving the F biases of the first layer from thememory; and for the first layer, processing the received data or dataoutput from a layer previous to the first layer with the F quantizedfilters and the F biases, wherein a number of channels of the receiveddata or data output from the layer previous to the first layer is equalto K.
 2. The method of claim 1, further comprising: during theinstantiation of the neural network, further storing the dictionary inthe memory; and during the classification of the received data, furtherretrieving the dictionary from the memory.
 3. The method of claim 1,wherein C is equal to 2^(n), where n is a natural number.
 4. The methodof claim 3, wherein each of the scalar centroid identifiers is expressedusing n-bits.
 5. The method of claim 1, wherein the F×K kernels areclustered into the C clusters using the k-means algorithm.
 6. The methodof claim 1, wherein each of the F×K kernels is represented by a 3×3matrix of parameters.